I expect to become more fluent with R tools and the data science workflow. This course seems to be a good place to get an all-around grasp of the basic ways of working with data science tools.
https://github.com/juhopaak/IODS-project
date()
## [1] "Wed Nov 25 20:19:12 2020"
date()
## [1] "Wed Nov 25 20:19:12 2020"
Let’s begin by reading the analysis dataset created in the data wrangling part, and exploring its structure and dimensions.
data <- read.csv("data/learning2014.csv")
dim(data)
## [1] 166 7
str(data)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
The dataset has 166 observations and 7 variables. The variables are “gender”, “age”, “attitude”, “deep”, “stra”, “suft”, and “points”. The observations are student answers to the international survey of Approaches to Learning, on an introductory statistics course in 2014. The attitude variable is a score indicating the student’s attitude toward statistics, on scale 1-5. Thee deep, stra, and surf variables are average scores of answers to questions about different approaches to learning (deep learning, strategic learning, and surface learning), measured on the Likert scale. The points variable is an integer representing the student’s test score.
Let’s continue by looking at summaries of the variables and plotting the data.
summary(data)
## gender age attitude deep stra
## F:110 Min. :17.00 Min. :1.400 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :3.200 Median :3.667 Median :3.188
## Mean :25.51 Mean :3.143 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :5.000 Max. :4.917 Max. :5.000
## surf points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
We can first see that the student gender distribution includes almost twice the number of females in comparison to males.
The youngest students were 17 years old and the oldest 55 years. The 3rd quartile of age is 27, which means that most of the students are between 17-27 years old. We can also see this in the variable’s histogram plot.
hist(data$age)
The attitude variable is distributed between 1.4 and 5, with half of the answers between 2.6 and 3.7. This variable’s distribution resembles the normal distribution, as displayed below.
hist(data$attitude)
The deep variable is distributed between 1.583 and 4.917, but the 1st quartile is 3.333, so most of the answers were between 3.333 and 4.917. In the stra variable we see a more even distribution between 1.250 and 5, but still the 1st quartile is 2.625, so most of the values are clustered between 2.625 and 3.625. The surf variable, on the other hand, seems to be somewhat skewed towards lower values, with distribution between 1.583 and 4.333 but the 1st quartile at 2.417. The histograms below confirm these observations.
par(mfrow =c(2,2))
hist(data$deep)
hist(data$stra)
hist(data$surf)
The test points are distributed between 7 and 33, but the 1st quartile is 19, the median 23 and the 3rd quartile 27.75. It seems that the scores are distributed quite evenly between 19 and 27.75. Most of the scores seem to be between 15 and 30, as shown below.
hist(data$points)
Let’s next look at a scatter diagram of the data, which displays the relationships between the different variables.
pairs(data[-1], col=data$gender)
From inspecting the scatter diagram it seems that the variables most strongly correlated with points are attitude, stra, and surf. Let’s try these three as explanatory variables in the linear model.
model <- lm(points ~ attitude+stra+surf, data = data)
summary(model)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
From the coefficient estimates of different parameters we can see that attitude predicts the value of points with the effect of 3.3952, whereas stra and surf with 0.8531 and -0.5861 respectively. This means that value changes in the attitude parameter are more strongly associated with changes in the value of points. In fact, stra and surf are quite weakly associated with points, which is also reflected in their p-scores, which indicate that these variables do not have a significant relationship (<0.05) to the points variable. A large p-score means that the probability of observing at least as extreme values of the target variable is be high, even though the coefficients were not related to the target variable. The p-values of stra and surf mean that we have no strong evidence on the basis of the data to reject the null hypothesis that stra and surf have no effect on the values of the points variable.
Let’s fit the model again with just the attitude as explanatory variable.
model <- lm(points ~ attitude, data = data)
summary(model)
##
## Call:
## lm(formula = points ~ attitude, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
In this model, the coefficient estimates show that attitude predicts the value of points with the effect of 3.5255, which is slightly stronger than in the previous model. Also the standard error of the coefficient is slightly smaller, and the p-value is notably smaller. The results show that attitude is a highly significant predictor of points.
The multiple R-squared gives the proportion of variation in the target variable points that is explained by the regression coefficients, that is, attitude. We can see that attitude alone explains around 19% of the variation in the points variable.
Finally, let’s examine diagnostic plots of the model.
par(mfrow =c(2,2))
plot(model, which=c(1,2,5))
The model is based on four important assumptions: linearity, constant variance, normality of residuals, and that the model’s errors are independent. The plots above can be used to examine the validity of the latter three.
The constant variance assumption holds that the size of the model’s errors should not depend on the explanatory variable. This assumption can be validated by plotting the residuals (model errors) against fitted values. The assumption is validated if the resulting scatterplot is reasonably evenly distributed. Indeed, we can see from the residuals vs. fitted values plot above that this is the case.
The normality assumption holds that the model’s errors are normally distributed. This assumption can be validated by plotting the residuals against values drawn from the normal distribution. If these values fall roughly along the diagonal line in the Normal Q-Q plot, then the model’s errors are approximately normally distributed.
The residuals vs. leverage plot helps assess whether there are some observations that have a large influence on the estimation of the model’s parameters. If there are some observations that have high leverage, then excluding them from the analysis has a large influence on the model. In the plot, if some points fall within areas delimited by the red dashed line (Cook’s distance), then they correspond to influential observations. It seems that in this model there are no highly influential residuals.
Together from these plots we can see that the model’s assumptions seem to be valid.
date()
## [1] "Wed Nov 25 20:19:13 2020"
Let’s begin by reading the joined dataset from local directory.
data <- read.csv("data/alc.csv")
dim(data)
## [1] 382 35
As displayed above, the data includes 382 observations and 35 variables. The data are about student achievement in secondary education of two Portuguese schools, collected by using school reports and questionnaires. The data used in this analysis is a combination of two datasets from distinct subjects: Mathematics and Portuguese language. The dataset includes the following variables.
colnames(data)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
In this analysis we will study the relationship between high/low alchohol consumption of the students (the variable “high_use”), and four other variables in the data. As the four explanatory variables, let’s select
My personal hypothesis is that all these variables will be significantly related to the students’ volume of alcohol consumption. In particular, I expect that (1) the older the student and the more time they use for studying, the less likely they are to consume high volumes of alcohol; and (2) students who have had many class failures and have a low quality of family relationships are more likely to consume a lot of alcohol.
Let’s next explore the distributions of these four variables and their relationship with alcohol consumption.
Let’s first look at bar charts of the four explanatory variables.
par(mfrow =c(2,2))
hist(data$age)
hist(data$studytime)
hist(data$failures)
hist(data$famrel)
It seems that most of the students are between ages 15-18, and that most of them use less than 5 hours a week for studying. Further, most of them have had no class failures, and their family relations are generally very good.
If we then look at the summary and a boxplot of the students’ alcohol use, we see that the means of their answers to daily and weekend alcohol use are distributed between 1 and 5, but the majority of answers are between 1 and 2.5. In fact, it seems that the mean score of 5 is an outlier, as we can see from the boxplot.
summary(data$alc_use)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.500 1.889 2.500 5.000
boxplot(data$alc_use)
If we then look at cross-tabulations of the four explanatory variables and the logical variable representing high/low alcohol use, we see that my initial hypotheses will not be very likely to hold, at least without qualification.
First, from the table of age against alcohol use, we see that the highest ratios of high alcohol use in comparison to low alcohol use are in the age groups of over 16. This means that the older students consume more alcohol than the younger students, contrary to what I expected.
table(data$age, data$high_use)
##
## FALSE TRUE
## 15 63 18
## 16 79 28
## 17 64 36
## 18 54 27
## 19 7 4
## 20 1 0
## 22 0 1
From the table of study time against alcohol use, we see that indeed students who study less than 5 hours a week seem to consume more alcohol than those who study more.
table(data$studytime, data$high_use)
##
## FALSE TRUE
## 1 58 42
## 2 135 60
## 3 52 8
## 4 23 4
However, when tabulating past failures against alcohol use, we see that most of the students who have high alcohol consumption have no past failures. This is likely due to the fact that so few students in the data have past failures in the first place.
table(data$failures, data$high_use)
##
## FALSE TRUE
## 0 244 90
## 1 12 12
## 2 10 9
## 3 2 3
Finally, from the table of family relations against alcohol use, we see that most of the students with high alcohol consumption have good family relations - again likely due to most of the students in the data having good relations.
table(data$famrel, data$high_use)
##
## FALSE TRUE
## 1 6 2
## 2 10 9
## 3 39 25
## 4 135 54
## 5 78 24
Let’s now explore the relationship between these variables using logistic regression. The call glm() below fits the model to the alc data. The summary of the model is given after.
model <- glm(high_use ~ age + studytime + failures + famrel, data = data, family = "binomial")
summary(model)
##
## Call:
## glm(formula = high_use ~ age + studytime + failures + famrel,
## family = "binomial", data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4449 -0.8466 -0.6790 1.2448 2.2240
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.1233 1.7523 -1.212 0.225624
## age 0.1961 0.1023 1.916 0.055335 .
## studytime -0.5466 0.1584 -3.451 0.000559 ***
## failures 0.2855 0.1928 1.480 0.138767
## famrel -0.2541 0.1265 -2.009 0.044537 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 436.65 on 377 degrees of freedom
## AIC: 446.65
##
## Number of Fisher Scoring iterations: 4
From the results we can see that two variables in the model are significant predictors of high alcohol use, namely studytime and famrel. Of these, studytime is a highly significant predictor. The sign of the estimates of the coefficients of both these variables are negative. This means that students who study more are less likely to consume high volumes of alcohol, and similarly students who have good family relations are less likely to drink high volumes. Each increase in studytime will decrease the log odds of having high alcohol consumption by 0.5466, and each increase in family relations will decrease the odds by 0.2541.
The results also show that there is no strong evidence for rejecting the null hypothesis that the students’ age and past class failures are related to their high/low alcohol consumption.
Let’s now compute the odds ratios (OR in the below table) of the model coefficients, and provide confidence intervals for them.
# Odds ratios (OR) from model coefficients
OR <- exp(coef(model))
# Confidence intervals (CI)
CI <- exp(confint(model))
## Waiting for profiling to be done...
# Tabulate results
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.1196357 0.003741489 3.6605851
## age 1.2166121 0.997143320 1.4906891
## studytime 0.5789116 0.420249609 0.7831056
## failures 1.3303606 0.911465723 1.9525463
## famrel 0.7756136 0.604403440 0.9941132
The coefficient of studytime has odds ratio of ~0.58, which means that students who study more are almost twice as likely to not have high alcohol consumption than students who study less. In the case of famrel, the odds ratio is ~0.78, which means that family relations are not as strongly associated with high consumption as study time. But still students who have good family relations are more likely to have low alcohol consumption than students with good relations. We can also see that student age and past failures are positively related to alcohol consumption, but these variables did not have a significant relationship in the model, so we cannot take this as evidence that there actually exists a relationship in the data.
The confidence intervals give the range of values within which the strength of the relationships between the variables are likely to fall with 95% confidence. So of 100 times that we calculate a confidence interval for e.g. the strenght of the relationship between studytime and high/low alcohol use, 95 times of these the true value will be within the specified range.
For the coefficient estimate of studytime, the 95% confidence interval is (0.42, 0.78), and for famrel (0.60, 0.99). Both of these ranges preserve the direction of the relationship. However, for instance for the failures variable, the interval is (0.91, 1.95). First, this is quite wide, and second we cannot say with 95% for this variable whether having many failures has a positive or negative influence on the log odds of having high alcohol consumption. We can see that the same is true of the age variable, although the interval is not as wide.
These results contradict my initial hypotheses in that age and past failures were not significantly related to high alcohol consumption. Study time and family relations, however, were significantly related, as I expected. Further, study time was a highly significant predictor.
Let’s now evaluate the predictive power of these significant explanatory variables. Let’s start by fitting a model with just these two variables.
model <- glm(high_use ~ studytime + famrel, data = data, family = "binomial")
summary(model)
##
## Call:
## glm(formula = high_use ~ studytime + famrel, family = "binomial",
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3749 -0.8225 -0.7364 1.3140 2.1048
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.3041 0.5700 2.288 0.022139 *
## studytime -0.5946 0.1529 -3.888 0.000101 ***
## famrel -0.2563 0.1243 -2.062 0.039207 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 444.08 on 379 degrees of freedom
## AIC: 450.08
##
## Number of Fisher Scoring iterations: 4
In this model, studytime seems to have an even stronger relationship with alcohol use, while the relationship between famrel and alcohol use remained nearly the same.
Let’s now predict the probability of high_use using this model, and tabulate the results.
library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0 ✓ purrr 0.3.4
## ✓ tibble 3.0.1 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ─────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
probabilities <- predict(model, type = "response")
# add the predicted probabilities to data
data <- mutate(data, probability = probabilities)
# use the probabilities to make a prediction of high_use
data <- mutate(data, prediction = probability > 0.5)
# tabulate the target variable versus the predictions
table(high_use = data$high_use, prediction = data$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 265 3
## TRUE 109 5
From the confusion matrix we can see that the model correctly predicts 265 out of 268 cases of low alcohol, but incorrectly predicts that 109 cases of high alcohol use would have low use. Only 5 out of the total of 114 cases of high use were predicted correctly. Let’s compute the training error to see how the model fares overall.
The following code defines mean prediction error as the loss function and calculates for the model’s predictions.
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_func(class = data$high_use, prob = data$probability)
## [1] 0.2931937
The mean prediction error is 0.2931937, which means that the model correctly predicts high/low alcohol consumption in over 70% of the cases in the data.
To end this exercise, let’s still do 10-fold cross-validation and compare the results with the model introduced in DataCamp (with error of 0.2617801).
# We'll use the package boot for performing cross-validation
library(boot)
# K=10 means 10 folds
cv <- cv.glm(data = data, cost = loss_func, glmfit = model, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2984293
My model has higher average error with 10-fold cv than the DataCamp model. Better than flipping a coin, though.
date()
## [1] "Wed Nov 25 20:19:14 2020"
Let’s load the Boston data and explore its dimensions. The dataset contains information collected by the U.S. Census Service, about housing in the Boston area.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
The dataset has 506 observations and 14 variables. All the variables are numeric, and they are about phenomena such as crime rate per capita by town (crim), proportion of non-retail business acres per town (indus), and average number of rooms per dwelling (rm).
Let’s now summarize the variables in the data, and then look at a graphical overview of them.
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
It seems that the first two variables (crim - crime rate per capita, and zn - proportion of residential land zoned for lots over 25,000 sq.ft) vary between quite a large range, but are strongly skewed towards small values. The rest of the variables seem to be more evenly distributed between their minimum and maximum values.
Let’s now look at a graphical overview of the variables.
pairs(Boston)
It’s pretty hard to get a good overview of so many variables at a glance, so we’d need to do more detailed examinations of the relationships between them. Let’s use the correlation plot to first look at the relationships between all the variables, and then get a better look at the most interesting ones.
library(corrplot)
## corrplot 0.84 loaded
cor_matrix <- cor(Boston)
corrplot(cor_matrix, method="circle", type="upper")
It seems that crim is positively correlated with rad and tax, whereas the indus variable is positively correlated with nox and tax, but negatively correlated with dis. Let’s plot these variables against each other.
We’ll begin with crim, rad and tax, which measure crime rate per capita by town, index of accessibility to radial highways, and full-value property tax rate per $10,000, respectively
pairs(~crim+rad+tax,data=Boston)
High values of both rad and tax seem to also bring crime rate up, but there’s a break in the relationship in mid values of these variables. Let’s next look at the indus variable against nox, tax and dis. These variables measure the proportion of non-retail business acres per town (indus), nitric oxides concentration (nox), property tax, and the distance to five Boston employment centres (dis).
pairs(~indus+nox+tax+dis, data=Boston)
Interestingly, the proportion of non-retail business acres per town indeed seems to grow with nitric oxides concentration. The positive relationship seems to be less clear with property tax, while the negative relationshipn with distance to Boston employment centres seems quite strong.
Let’s now standardize the dataset and create a categorical variable of crime rate per quantiles in order to predict it using linear discriminant analysis (LDA).
We’ll begin by scaling and summarizing the data again.
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
The variables now have zero means and their values have been normalized in relation to the original data’s standard deviation. This means that the value distributions now tell us how much each variable varies around their means in relation to the standard deviation, which makes their values comparable against each other. For instance, the max value in the crim variable now is at 9.924110, while the max of the tax variable is only at 1.7964. The positive values of the tax variable seem to be closer to the variable’s mean, so it has fewer outliers than crim. This we can also see from the boxplots of these variables
par(mfrow =c(1,2))
boxplot(Boston$crim, xlab="Boston$crim")
boxplot(Boston$tax, xlab="Boston$tax")
Let’s change the scaled data into a data frame and create a categorical variable of crime rate using the quantiles as break points.
boston_scaled <- as.data.frame(boston_scaled)
bins <- quantile(boston_scaled$crim)
# We'll label the values of the new variable "low", "med_low", "med_high", and "high"
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label=c("low", "med_low", "med_high", "high"))
Finally, let’s replace the old crime variable in the scaled dataset with this new variable and tabulate its values.
boston_scaled$crim <- crime
table(boston_scaled$crim)
##
## low med_low med_high high
## 127 126 126 127
To evaluate our classification, let’s divide the data into sets used for training and testing the classifier.
library(dplyr)
# Randomly choose 80% of the observations and select these for the training set
n <- nrow(boston_scaled)
ind <- sample(n, size=n * 0.8)
train <- boston_scaled[ind,]
# Use the rest for the test set
test <- boston_scaled[-ind,]
Now we’re ready to fit the LDA model to the training data and evaluate it using the test data. Let’s also draw the biplot of the model.
# linear discriminant analysis
lda.fit <- lda(crim ~., data = train)
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crim)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)
From the model results below we can see that the first dimension which separates high values from the rest explains over 94% of the variance in the data (look at the proportion of trace at the end of the summary). The second dimension seems to discriminate between low and med_low and med_high values. However most of the variance in the data seems to be between high values and the rest.
lda.fit
## Call:
## lda(crim ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2351485 0.2648515 0.2376238 0.2623762
##
## Group means:
## zn indus chas nox rm age
## low 1.0064638 -0.9454999 -0.18944275 -0.8801687 0.42640177 -0.8538872
## med_low -0.1069561 -0.2592784 -0.05155709 -0.5528204 -0.16733006 -0.2713260
## med_high -0.3782608 0.2005473 0.13778554 0.4158727 0.09150459 0.4265675
## high -0.4872402 1.0170298 -0.01233188 1.0659682 -0.43072446 0.8092624
## dis rad tax ptratio black lstat
## low 0.9122440 -0.6917322 -0.6998009 -0.45011801 0.37319756 -0.749808926
## med_low 0.3364431 -0.5471711 -0.4461912 -0.03083151 0.31704690 -0.091922990
## med_high -0.3716017 -0.4315641 -0.3172377 -0.32444631 0.06049457 0.004122183
## high -0.8448767 1.6390172 1.5146914 0.78181164 -0.70789670 0.853442296
## medv
## low 0.48354282
## med_low -0.03750701
## med_high 0.16745601
## high -0.64681709
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.05871507 0.779274050 -0.91488064
## indus 0.08606607 -0.351119851 0.32094992
## chas -0.09266015 0.016498312 0.12074843
## nox 0.32390716 -0.799329669 -1.39593428
## rm -0.13562370 -0.077931010 -0.16478355
## age 0.18636836 -0.303452518 -0.08193777
## dis -0.03140696 -0.386432755 0.15248479
## rad 3.41539384 0.813214662 -0.01810225
## tax -0.00968846 0.177471604 0.49645403
## ptratio 0.08258639 -0.001065723 -0.20946225
## black -0.10750158 0.063539918 0.15332078
## lstat 0.19666791 -0.226037658 0.32894648
## medv 0.17762726 -0.456725843 -0.18084282
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9507 0.0364 0.0129
Let’s now test the model on unseen data. First we’ll save the correct categories of the crime variable and remove them from the test data. Then we’ll predict the test data and display a confusion matrix of the results.
correct_classes <- test$crim
test <- dplyr::select(test, -crim)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 15 15 2 0
## med_low 4 14 1 0
## med_high 0 11 17 2
## high 0 0 0 21
Our classifier is able to correctly predict 22/22 of the high values in the test data, and 20/30 of the med_high values. However, it fares worse on the low end, with 17/24 of the med_low values predicted correctly, and only 12/26 low values predicted correctly. This might be due to most of the variance in the data being between high values and the rest.
To end this exercise we’ll reload the Boston data to do k-means clustering. We’ll begin by scaling the data again. Let’s also look at a summary of the data to see that everything is ok.
data("Boston")
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
Everything seems to be as before, so we can go on. Next we’ll calculate euclidean distances between the observations and run k-means (which by default uses euclidean distances and also does the calculation for us).
Let’s first try with 3 clusters.
dist_eu <- dist(boston_scaled)
km <-kmeans(boston_scaled, centers = 3)
# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)
Again it’s difficult to get a good overview of so many variables. Let’s look in more detail at the last four.
pairs(boston_scaled[,10:14], col = km$cluster)
The model seems to capture some clusters of values in the data well. For instance, it correctly clusters high values of the tax variables into one cluster. However, it seems to mix up the low values in two different clusters, although they are pretty closely located together, as can be seen from the plots.
Let’s now investigate what the optimal number of clusters is by checking the elbow plot using the total of within cluster sum of squares as the error measure.
set.seed(123)
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
# visualize the results
plot(x = 1:k_max, y = twcss)
From the plot we see that the “elbow” is at two clusters, so we’ll rerun kmeans with 2 clusters. Let’s again look at the last four variables.
km <-kmeans(boston_scaled, centers = 2)
pairs(boston_scaled[,10:14], col = km$cluster)
This model seems to do somewhat better in capturing different clusters of values in the data. For instance, observations with low values of the tax variable are now much more clearly allocated to the same cluster. However, the model still cannot tell apart for instance observations which have low values in black from those that have high values in black. This is might be due to those observations being more clearly separated from each other along some other variable.
date()
## [1] "Wed Nov 25 20:19:26 2020"
Let’s begin by loading the data and again changing countries as rownames again.
# Read data from local csv
human <- read.csv("data/human_reduced.csv", header=TRUE)
# Change countries as rownames and drop the redundant column
rownames(human) <- human$X
human <- human[2:ncol(human)]
summary(human)
## edu_ratio part_ratio life_expect edu_expect
## Min. :0.1717 Min. :0.1857 Min. :49.00 Min. : 5.40
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:66.30 1st Qu.:11.25
## Median :0.9375 Median :0.7535 Median :74.20 Median :13.50
## Mean :0.8529 Mean :0.7074 Mean :71.65 Mean :13.18
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:77.25 3rd Qu.:15.20
## Max. :1.4967 Max. :1.0380 Max. :83.50 Max. :20.20
## gni maternal_mortality adol_br repr_percent
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
…
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggpairs(human, progress=FALSE)
…
library(corrplot)
corrplot( cor(human) )
pca_human <- prcomp(human)
s <- summary(pca_human)
pca_pr <- round(100*s$importance[2,], digits = 1)
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab=pc_lab[1], ylab=pc_lab[2])
human_std <- scale(human)
pca_human <- prcomp(human_std)
s <- summary(pca_human)
pca_pr <- round(100*s$importance[2,], digits = 1)
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab=pc_lab[1], ylab=pc_lab[2])
Give your personal interpretations of the first two principal component dimensions based on the biplot drawn after PCA on the standardized human data. (0-2 points)
Load the tea dataset from the package Factominer. Explore the data briefly: look at the structure and the dimensions of the data and visualize it. Then do Multiple Correspondence Analysis on the tea data (or to a certain columns of the data, it’s up to you). Interpret the results of the MCA and draw at least the variable biplot of the analysis. You can also explore other plotting options for MCA. Comment on the output of the plots. (0-4 points)
library(FactoMineR)
library(dplyr)
data("tea")
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
tea_time <- dplyr::select(tea, any_of(keep_columns))
str(tea_time)
## 'data.frame': 300 obs. of 6 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
…
library(ggplot2)
library(tidyverse)
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
…
mca <- MCA(tea_time, graph = FALSE)
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.279 0.261 0.219 0.189 0.177 0.156 0.144
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519 7.841
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953 77.794
## Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.141 0.117 0.087 0.062
## % of var. 7.705 6.392 4.724 3.385
## Cumulative % of var. 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139 0.003
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626 0.027
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111 0.107
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841 0.127
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979 0.035
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990 0.020
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347 0.102
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459 0.161
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968 0.478
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898 0.141
## v.test Dim.3 ctr cos2 v.test
## black 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 2.867 | 0.433 9.160 0.338 10.053 |
## green -5.669 | -0.108 0.098 0.001 -0.659 |
## alone -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 3.226 | 1.329 14.771 0.218 8.081 |
## milk 2.422 | 0.013 0.003 0.000 0.116 |
## other 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
…
plot(mca, invisible=c("ind"), habillage = "quali")
…